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APPENDIX A 



The following parameters are utilized by the AMB algorithm: 

Input: X - the given design matrix (continuous + categorical) (dimension: m x n, m = # 
of records, n = # of predictors); 

y - the dependent/target variable vector (dimension: m x 1) 

Output: s - the solution vector (the model parameter vector, including the "bias" 
term) (dimension: (nl+l) x 1) 

StepQ 

For each continuous predictor 

If (there is any missing observation value) 
Perform Missing Value Substitution 

End 

Step 1 

For each continuous predictor 

If (exponentially distributed) 

Log-scale the predictor and flag it 
End 

End 

Detect outliers 

End 

Step2 

// Perform Univariate Analysis for all n predictors 
If (size(continuous) > 0) 

For each continuous predictor 

Calculate its Pearson's r value (with the target) 

End 

End 

If (size(categorical) > 0) 

Bin the continuous target variable 

Calculate its Cramer's V value (on the binned target groups) 

End 

Sort continuous predictors in Pearson's R value 
Sort categorical predictors in Cramer's V value 

// Assume n = n conti + n_cate, n conti = # of continuous, n^cate = # of categorical 
If n_conti > 200 

Retain top 135 + ((n_conti - 200)*0.3) (30% continuous with large R values) 
Else if 100 < n_conti <= 200 

Retain top 85 + ((n_conti - 100)*0.5) (50% continuous vAih large R values) 
Else if 50 <n conti<=100 
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Retain top 50 + ((n_conti - 50)*0.7) (70% continuous with large R values) 
Else // n_conti <=50 

Retain all predictors 

End 

If n_cate > 200 

Retain top 135 + ((n_cate - 200)*0.3) (30% categorical with large V values) 
Else if 100 < n_cate <= 200 

Retain top 85 + ((n_cate - 100)*0.5) (50% categorical vnih large V values) 
Else if 50 < n_cate <=100 

Retain top 50 + ((n_cate - 50)* 0.7) (70% categorical with large V values) 
Else // n_cate <=50 

Retain all predictors 

End 
Step 3 

If (size(categorical) > 0 & size(continuous)>0) 

// Merge categorical with continuous (in favor of continuous) 
Categorize continuous predictors 

For each categorical predictor cl 
For each continuous predictor c2 

Compute the Cramer's V value between cl and c2 
IfCramer V(cl,c2)>0.5 
Remove cl from the retained list 

End 

End 

End 

End 

If (size(categorical) > 0) 

Expand all retained categorical predictors into dummies 

End 

If (size(categorical) > 0 && size(continuous) > 0) 

Formulate the new design matrix X by combining retained categorical and continuous 
predictors 

End 

Step 4 

Normalize (not z-scaling) all retained predictors (X) and obtain the new design matrix X' 
Step 5 

Formulate the normal equation N = X'^X' (matrix-matrix multiplication, dimension of N : nl x 
nl) 

//Filter out strongly coUinear predictors 

While there is an off-diagonal-element of lower_triangle(X'^X') with its absolute value > 0.8 
// assume the index is (i, j) and i > j 

Compute the correlation r_i between the target and the ith predictor 
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Compute the correlation r_j between the target and the jth predictor 
Ifr_i>rJ 

Remove jth predictor from the retaining predictor list 

Else 

Remove ith predictor from the retaining predictor list 

End 

End 

If any predictor deletion (above) performed 

Reformulate the design matrix X' and the corresponding normal equation N = X'^ X' 
(matrix-matrix multiplication) 
[m, nl] = size(X') 

End 
Step 6 

Perform PCA on N via SVD(N) and obtain the loading matrix M (dimension: nl x nl) 
and the latent vector 1 (dimension: nl x 1) 

Step? 

If PCA successfiil (i.e., the SVD in PCA does not fail) 

Sort the latent vector 1 in increasing order and obtain the sorting index; 

Use singular values 1 and the sorting index to identify a few bottom components C (i.e., 

the last d columns of M, dimension: nl x d) that represents 10 % of variance accounted 

for; 

If(nl-.d<10) 

Reformulate C by including only the last d2 (= nl — 10) columns of M 
Reset d = d2 

End 

Scan all columns/components in C and delete dl (<=d) components that don't have a 
predictive strength, i.e., [Pearson's R(target, component)) <0.3 

Steps 

k = nl-dl 

Formulate the Mapping matrix M' from M (by removing those dl components, 
dimension of M' : nl x k) 
While (k >= m) 

Delete the bottom components according to the singular value 
End While 

Reset k to the size of remaining components 

Compute A' = X'M' (matrix-matrix multiplication, dimension of A': m x k) 

Step 9 

Append the "bias" column (all I's) to A' as its (new) first column (dimension of A': m x 
(k+1)) 

Pass A' to Engine (SVD + possibly a random initial guess and CGD) for component 
regression and generate a solution vector w (dimension: (k+1) x 1) 

Step 10 
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// Map w back to the predictor space 

— Compute the solution vector s = M' * w [2..k+l] (muhiplication of matrix M' and a 

partial vector of w (from w[2] to w[k-hl]) (dimension of s : nl x 1) 

~ Add the "bias" term (i.e., w[l]) to s as its (new) first entry (dimension of s : (nl+1) x 

1) 

Else //PCA failed 
Steps 1 1 

Append the "bias" column to X' as its (new) first column (dimension of X' : m x (nl+1)) 
While (n+1 >= m) 

Delete the remaining least correlated (with target) variable 
End While 

Reset n+1 to the size of retained design matrix 

Pass all retained predictors X' to Engine (SVD + possibly a random initial guess and 
CGD) for predictor regression and generate a solution vector s (dimension: (nl+l)x 1) 
End 
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